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Abstract 

In this paper, we present a possible theoretical explanation for benford's law. We develop a recursive relation 
between the probabilities, using simple intuitive ideas. We first use numerical solutions of this recursion and 
verify that the solutions converge to the benford's law. Finally we solve the recursion analytically to yeild the 
benford's law for base 2. 

1 Introduction 

The leading significant digit of a random integer is one of 1, 2 • • • 9. Intuitively, it is equally likely to be any of these 
nine figures. However, empirical observations, and the benford's law indicate the contrary. According to the law, 
the probability that a random integer, expressed in base 10, starts with the digit d is[l\ 

Pd = Log w {l + -) (1) 

d = 1,2 •••9. This law was first proposed by newcomb in 1881[2J. It means, a random integer is most likely 
to start with 1, with a probability of 0.301, and least likely to start with 9, with a probability of 0.046. Note 
that the random integer is unsealed; i.e., it can be arbitrarily large. This is the suspected reason behind the 
nonuniform probabilities. On the other hand, if the random number was scaled, i.e, chosen from a bounded set, 
the corresponding probabilities are obtained through a direct calculation. For instance, consider a scale of 100, ie, 
the number is chosen from the set [0,100); the probabilities are indeed uniform. However, if the scale were 200, 
they would be nonuniform, with d — 1 acquiring a very large probability(> ^). In this paper, we use these scaled 
probabilities to arrive at the benford's values of unsealed probabilities. Before we proceed, we shall state the well 
known generalizations of benford's law. 

The law is generalized to first two digits. The probability that a random integer starts with digits d\di is given 

by 

P dld2 = Log 10 (l + ) = Log w (l + — - ) (2) 

d% + Wdi d\d2 

It is further generalised to arbitrary number of significant digits, and expressed in an arbitrary base b as 

P dl - dk = Log b {\ + ) = Log b (l + ) (3) 

dt + bdk-i + ■ ■ ■ + o di di ■ ■ ■ df* 

where d\ ■ ■ ■ dk is the number expressed in base 6[4]. We shall consider the simple case of base 2. In the next section, 
we present the basic idea behind the proof, supported with examples and numerical calculations. The analytical 
proof is provided in section 3. We end with a brief discussion, in section 4. 



2 Basic idea behind the proof and numerical estimates 

Expressed in base 2, every number starts with 1. Hence we consider the first two significant digits, which are either 
10 or 11. Let P\q and P\\ be the corresponding probabilities. According to benford's law, Pio = log2(l + |) = 
0.5849625, P n = log 2 (l + §) = 0.4150375. 

These are the unsealed probabilities. Unlike them, the scaled probabilities are easily evaluated. For instance, 
consider a scale of 1000; i.e, the random integer is chosen from the set [0, 1000). Since, in this set, numbers starting 
from 10 and 11 are equally populated, the corresponding probabilities are | each. This is true of any scale of the 
form 100 • • • 0. Accordingly let us denote them by Pl§ — Pjf = \. The superscript indicates that the scale is of the 
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form 10 • • • 0. Now consider a scale of 1100. It can be verified that the probabilities are now | and i. 



true of any scale of the form 110 • • • 0. Let us denote them by P 10 = ^ and P{{ = 



Thus, the unsealed probability Pio is in between P^§ and P^, and Pn is inbetween Pjf and P^ 1 . 
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Also, this is 

(4) 
(5) 



where w is the weight assosiated with the scale being of the form 100 • • • 0. To a first order, it can be approximated 
to the probability that a randomly chosen scale starts with 10, which is Piq. Thus, 



10 



Pn - PnPio + PnPn 



(6) 
(7) 



this gives P 10 = | = 0.57142, and P n = f = 0.42857. These are the first order approximations. The approximation 
lies in the assumption w = P w ; all integers starting from 10 are not of the form 10 • • • 0. 

To sharpen the approximation, consider the first three significant digits. Using a similar notation, we denote 
the unsealed probabilities by P\ xyi where x,y = 0, 1. And the scaled probabilities by P\ X y x,y,a 7 (3 = 0, 1. P\ X y 
is the probability that a random integer starts with lxy when the scale is of the form la/30 • • • 0. The equations, to 
the second approximation are 

Pi*y = Y, P %y P W ( 8 ) 

a/3 

This is a set of four equations in four variables. Once we solve for Pi a p, we can evaluate Pio using P w = Pioo + Pioi- 
To do this, we are to first evaluate P\"y > tne population fraction of numbers starting from lxy in the integer 
set S = [0, la/30 • • • 0). This set can be broken in to three chunks S = So U Si U S2 where So, Si and £2 are the 
integer sets, 



So = [0, 1000 •••0) 

51 = [1000 •••0, la00---0) 

5 2 = [laOO ••• 0, la/30 ••• 0) 

Note that they are disjoint. So is the largest; Si is an enhancement over So and S2 is an enhancement over Si. If 
PO1P1 and P2 are the population fractions of numbers starting from lxy within the sets So, Si and S2 respectively, 
we may write 

,ia/3 _ Po|Sp| +Pi|Si| +p 2 |S 2 | 



P p = 

r \xy 



|S | + |Si| + |S 2 | 



(9) 



where, \Sj\ is the number of elements in Sj. Clearly, |Si| = §|So| and IS2I = f |So|. In So, the second and the 
third digits are equally distributed, i.e., 100, 101, 110, 111 appear with equal populations. Hence p — j. In Si, all 
numbers have second digit and the third digit is equally distributed between 1 and 0. So, pi = 5 x oj- In S2, all 
numbers have second digit a, and third digit 0. Therefore, p 2 = S xa S y0 . Thus, 



la/3 _ 1 + a $x0 + f3S xa S y o 



P p = 
r lxy 



4 + 2a + 13 



(10) 



The equation P 
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The solution, after normalizing the sum to 1 is 



-P100 


= 


3152 


P101 


= 


2626 


Pi 10 


= 


2251 


Pill 


= 


1969 
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Using, Pioo + Pioi = fioand Pno + Pm = Pn, we obtain P w = 0.5778, the second approximation. As expected, 
it is closer to the benford's value, 0.5849625, than the first approximation. 

Higher order approximations can be obtained by considering a larger number of digits. Considering k digits 
after the first digit, the equation to be solved is a 2 k x 2 k matrix equation 

Plx!---x k = ^2 Plxi—Xk Plai—a k (12) 

where P\ Xl ... Xk is the probability that an unsealed integer starts with lx\ ■ ■ ■ x k and the matrix element, Pi x *... x k 
is the corresponding probability with a scale of la\ ■ ■ ■ a k • • • 0. This can be evaluated easily. For values of k up 
to 10, they were solved numerically using python. Table -1 summarizes the results. The values suggest a neat 
convergence to the benford's value. Interestingly, the relative error falls exponentially. In the next section, we shall 
prove it analytically. 



k 


Pio 


Rel err 


1 


0.571428 


0.023 


2 


0.577861 


0.012 


3 


0.581339 


0.0062 


4 


0.583135 


0.0031 


5 


0.584045 


0.00156 


6 


0.584503 


0.00078 


7 


0.584732 


0.00039 


8 


0.584847 


0.00019 


9 


0.584905 


0.000097 


10 


0.584933 


0.000049 



Table 1: Estimates of P w up to k=10. Value according to benford's law: P w = 0.584962 



3 Analytical Solution 

In this section, we show that the benford's law is an exact solution to equation[12]. We are to solve the equation 
for P\ Xl ... Xk in the limit of k — » oo. And the matrix elements in this equation are evaluated in appendix A. 

plan-'-ctk _ 1 + Qax /-|„^ 

ixi-x* 2 fe (l + a) 1 ' 

We are to show that the solution is logarithmic, i.e., P\ Xl ... Xk = Log[l + lxi .. x ]■ Observe that this function has a 
first approximation of — - — , in the large k limit. Hence, we shall first show that this is a solution in the limit of 
large k. That is, we are to show, that 

I - u m V 1 1 + Qax (u) 

x and a are numbers between and 1 with k places. In the limit of k — > oo, x and a are any real numbers between 
and 1 and the sum is replaced by an integral 

' '' rl ^7T t % ( 15 ) 



(1 + aO Jo (l + «) 2 

We are to show the above relation. Q ax is the sum of an infinite sereis. The integral is easily evaulated for each of 
these terms, and then summed up. The details of this proof has been completed in appendix B. 
For a finite value of k, to evaluate P\p 1 -.p k , we write it as 

p iPi-Pk = ^2 p iPi~Pk<xi-ai ( i6 ) 



We have shown that in the large I limit, 



Mm%... ft « 1 ...„ l = — ^— — (17) 

;^oo Ipi ■ ■ ■ p k ai ■ ■ ■ ai 
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Thus, 



1 

P 101 ... 0k = /«mE {a<} 1 ff 1 - fa 1 ai ... a , = /^g o 2'(l^..-^ 



Normalizing, we obtain the benford's law 



Ij8i---Afc 



Pi,,..,, = Log, [ ^ ) (18) 



4 Discussion 

So far, little light has been thrown in to the counterintuitive nature of benford's law. We haven't reconstructed our 
intuition so as to understand the law. The origin of the anomalous behaviour is still unclear. A strong reason why it 
is counterintuitive is that, the cardinalities of numbers starting from any digit is the same, and therefore we expect 
the probabilities to be the same as well. One step towards understanding it is to realise that, the probabilities 
measure the occur ances and not the cardinalities. 

To understand it better, let {ai} be a sequence and {bi} be a sub sequence of {ai}. For instance, let a, = i and 
bi = 2i. ai is the sequence of positive integers and bi is the subsequence of even numbers. The probability that a 
randomly chosen element in {ai} is also an element in {bi} is \. Now, let {cj} be a subsequence of {bi}, C{ — Ai. 
the sequence of multiples of four. The probablity that a randomly chosen element in {ai} is also an element in {d} 
is \. Even though {bi} and {cj} have the same cardinalities, and can be mapped to each other, the probabilities 
are not equal. In fact, the sequence {a^} can be rearranged such that every alternate term is an element of {q}. 

K}:1,4,2,8,3,12,5,16,--- 

This sequence {a^} is a rearrangement of {ai}. The probability that a random element belongs to {cj} is now i. 
Hence, this probability is unrelated to the cardinality; instead, it is a measure of frequency of occurance of the 
elements of {ci} in the parent sequence {ai}. Hence, it changes on rearranging the parent sequence. 

In the above examples, all the occurances were periodic. Thus, even though the sequences were infinite, due to 
the periodicity, the calculation of the probability was as simple as it is in case of a finite set. However, in a benford 
sequence, there is no such periodicity, and therefore, the calculation is nontrivial. In this paper, we have outlined 
a possible analytical explanation for benford's law for base 2. It is very likely that, a similar strategy can yeild the 
law for any base. Therefore, further work in this direction is expected to be fruitful. 



5 Appendices 

5.1 Appendix A: Evaluating the matrix elements 

In this appendix, we evaluate the coefficients P 1 "^.\\'" k ■ It is the population fraction of numbers starting from 
lx\ • ■ ■ Xk in the set S = [0, la\ ■ ■ ■ a^OOO • • • 0). We shall use the same strategy again: break this set in to disjoint 
chunks. 

5" = [0, 100 • • • 0) U [100 • • • 0, laiO • • • 0) U • • • U [lai • • • a fe _i00 ■ • • 0, lai • • • a k ■ ■ • 0) 
defining the sets, 

S = [0,100 •••0) & S r = [lai ■ • ■ a r _i00 ■ • • 0, lai •••a r _ia r 0---0) 

we may write 

S = So U Si U • • • U S k 

Writing p r =population fraction of numbers starting from lxi ■ ■ ■ x k in the set S r and |S r |=size of S r , we may write 

jlai-a* _ Pol S 1 +f>l|Si| H \-p k \S k \ 



p 



^- Xk ~ |So| + |Si| + --- + |S fc 
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since the sets are disjoint. Clearly, |5 r | = frivol- And, in \So\, all numbers are equally populated, thus, po — jk- 
In the set S r , all numbers have the first r — 1 digits equal to ot\ ■ ■ ■ a r -\ respectively, and the r th digit is zero. The 
rest of thefc — r digits are or I with a probability of \ each. Thus, 

Pr fiaixi $0.2x2 ' ' ' fiat, — ix, — 1 fiox r ■ c^k— r 

Therefore, 

pl ai - a k _ 1 + OlSoxi + <^25ox2fia 1 x 1 j h OkfioxJa^Xk^ • ' • ^ttixi 

1Xl - Xk ~ In ; 

where la\ ■ ■ ■ a^ = 2 k + ai2 k ^ + • • • + ctk- We can express it conveneintly in a better notation. Let us define 

ai a 2 , "fee ari . a; 2 , , or fc 

« = Y + ^ + -" + ^ & ^y + ^ + "- + ^ 

a and x are numbers between and 1 with k places. In this notation, 

plai-a* _ 1 + O-lfiox! + a24x 2 ^ ttia:i H h Ofejoxfcjafc_iXfc_i • ' • jaiii 

ixi-x» - 2 fc (l + a) 

also, for brevity, define 

ai^Oxi + 0>2Sox 2 fia 1 x 1 + ■ ■ ■ + Otk5ox k fia k - 1 x k - 1 ' 1 • <5c*iXi = Qctx 

so that 

^lxi- 



->lai---a fc _ 1 + Qo 



5.2 Appendix B: Analytical Solution 

In this appendix, we show that 



2 k {l + a) 



1 , l + Qc 
da- 



(l + x) Jo (1 + a) 2 

Note that the first term, after performing the integral is |. For conveneince, let us make the substitution t= 1—x; 
t r = 1 — x r . The integral corresponding to r th term in Q ax is given by 

f 1 da 
Jo (! + <*) 

t r can be taken out. The delta terms inside fix the first r places of a. a, = Xi = 1 — U up to i = r — 1 and a r = 1. 
Thus, the integral can be written as: 

f br da _ ( 1_ _L \ 

where, [a r , b r ] is the range in which none of the deltas inside are zero. This range is given by 

a r = 0. Xl x 2 ■ ■ ■ av-il = x^-V + 1 = 1 - - 1 

and 

6 r = O.zi^ ■ • • a; r _illll • • • = 1 - t [r ~ 1] 
where is the approximation of t up to r places; 

[r] = !i + ^2 + ^i + ... + ^i 
2 2 2 2 3 2 r 

Thus, the integral corresponding to the r th term in Q ax is 

l 



2 r 



■5 



Thus, summing up, we obtain 

E l 1 + Qax _ 1 \ U_ 
2 k (1 + a) 2 ~ 2 2 r \(2 - tl r - 1 ]){2 - - 

Next we show that the sereis on the RHS sums up to or Consider, 

1 1 t[ r+1 l - tM t r+1 1 



2-tH-i] 2-tW (2-iW)(2-^+ 1 l) 2 r+1 (2 -tM)(2 -tM - 
Using the above repeatedly, we may expand 2 _ 1 t ; r | as 

1 1 h 1 t 2 1 i r 1 



+ 



2-tW 2 2 2(2-|) 2 2 (2- ^ - f )(2 - ^ - f - f ) 2 r (2 - i[»-i])(2 - - 

Further, since, i r can take only two values, and 1, we may write 

tf 1 ty 1 

2W (2-tI'- 1 ])(2-t['- 1 ] - |0 = 2^(2 - *['— 1 ])(2 - !1 -£) 

Thus, 

1 __ 1 ti 1 t 2 1 *r 1 

2 - *W ~ 2 + 2" 2(2 - i) + 22 (2 - 5 - \ )(2 _|_-*2_i) + '" + r(2- *[*— 1])(2 - tFi] - 
And continuing the sereis, 

1 1 ^ * / i 



1 ^ — > t r 

9 + 



2-t 2 ^2 r \(2-&-^){2-&-^ -£) 
Thus, 



Z" 1 , 1 + Qo. _ 1 1 
7o "(l + a) 2 2-t (1 + 



+ x) 
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